Heterogeneous parallel 3D image deconvolution on a cluster of GPUs and CPUs

نویسنده

  • L. Domanski
چکیده

This paper presents a heterogeneous computing algorithm for 3D Richardson-Lucy image deconvolution applicable for use on single heterogeneous workstations, all the way up to large distributed memory clusters consisting of many heterogeneous nodes. We demonstrate our solution on a cluster of nodes containing multiple CPU cores and GPUs. The algorithm uses a combination of message passing and massively-multicore programming technologies to achieve nested levels of parallelism, ranging from course grained domain decomposition across worker processes to more fine grain parallelism within worker processes utilising GPUs. The work distribution and worker framework is abstracted from the type of processor architecture used for core algorithm calculation by different worker processes. Allocation of computational resources (different processors or cores) to workers is handled collaboratively by the worker processes on each cluster node using efficient Operating System level counting semaphores, avoiding the need to manage computational resources centrally on the cluster. The tested implementation utilises MPI (Message Passing Interface) for parallelisation across the cluster, CUFFT and custom written kernels for parallelisation of algorithm components on the GPU, and the highly tuned MKL math library for computations on the CPU. Result show that utilising a collection of different processor types on available nodes can provided performance benefits over the use of a single type alone. It is common to find heterogeneous workstations with a smaller number of high performance accelerator processors than general purpose processor cores. In these cases, when considering the number of cluster nodes utilised versus performance, using all available processors on a node generally provides a performance gain whilst using the same number of nodes, or allows us to achieve similar performance using fewer nodes. We discuss situations where using multiple processor types at once can inhibit performance, and make recommendations on when such an approach would or would not be advantageous.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Accelerating high-order WENO schemes using two heterogeneous GPUs

A double-GPU code is developed to accelerate WENO schemes. The test problem is a compressible viscous flow. The convective terms are discretized using third- to ninth-order WENO schemes and the viscous terms are discretized by the standard fourth-order central scheme. The code written in CUDA programming language is developed by modifying a single-GPU code. The OpenMP library is used for parall...

متن کامل

A Technology of 3D Elastic Wave Propagation Simulation Using Hybrid Supercomputers

We present a technology of 3D seismic field simulation for high-performance computing systems with GPUs or Intel Xeon Phi coprocessors. This technology covers adaptation of a mathematical modeling method and development of a parallel algorithm. We describe the parallel realization designed for simulation based on using staggeredgrids and 3D domain decomposition method. We study the parallel alg...

متن کامل

Distributed Ray Tracer on GPU

Ray tracing is a method for producing photorealistic 3D computer generated imagery by modeling the interaction of light rays with a scene. Because each primary ray is independent of other primary rays being modeled, ray tracing offers massive degrees of parallelism that is suitable to parallel architectures like GPUs, multicore CPUs, and distributed computing environments. Our goal is to implem...

متن کامل

Co-processing SPMD Computation on GPUs and CPUs on Shared Memory System

Heterogeneous parallel system with multi processors and accelerators are becoming ubiquitous due to better cost-performance and energy-efficiency. These heterogeneous processor architectures have different instruction sets and are optimized for either task-latency or throughput purposes. Challenges occur in regard to programmability and performance when executing SPMD computations on heterogene...

متن کامل

Parallel 3D fast wavelet transform on manycore GPUs and multicore CPUs

GPUs have recently attracted our attention as accelerators on a wide variety of algorithms, including assorted examples within the image analysis field. Among them, wavelets are gaining popularity as solid tools for data mining and video compression, though this comes at the expense of a high computational cost. After proving the effectiveness of the GPU for accelerating the 2D Fast Wavelet Tra...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011